Joshua Koonce

Plant Seedling Image Classification

Convolutional Neural Networks Project

The objective of this project is to create a Convolutional Neural Network that can accurately identify seedlings presented to it in the form of image files.

The project will use a dataset from Kaggle that has been condensed down to a reasonable size for class purposes by the academic staff.

Since the data will be image data, there is no data dictionary for this project. The images and truth labels will be imported and processed.

Import the necessary libraries:

Check GPU compute is enabled. Tensorflow can see the GPU and use it.

Load Data and Summary

Load the images from the provided files. For this project it's more practical to keep the data in numpy arrays instead of pandas dataframes as the data is not tabular in nature.

We have 4750 color images with a resolution of 128x128.

Store the unique labels in the dataset.

Store the starting and ending index of the first image in each class for evaluation later.

Exploratory Data Analysis

Class balance assessment:

The classes are certainly imbalanced but there are a decent amount of samples from each class. I will try to build a model without oversampling, and if that doesn't go well, try to oversample.

Create a function to print the first image from each class so that we can evaluate image processing as it proceeds.

Print an image frome each label prior to processing, to make sure the data appears loaded correctly.

Average Images per Class

The average image doesn't have a whole lot of insights, although some are donut shaped (opposing leaves) and others are solid center green, whereas others you can barely see any green at all (grasses)

EDA Insights

Establish a Metrics Function

This function can be called on any model I create to display it's accuracy and loss per epoch, along with a confusion matrix and the typical classification metrics.

Baseline Model

First I'm going to run a simple baseline model on the unprocessed images with no class balancing, then run the same model on the processed images to see if it's making a difference in learning. I'll then attempt an oversample of minority classes if it's not getting the results I want. THEN I'll see if I can improve the performance with more complexity.

To do that I'm going to have to binarize the labels here and then do a test/train split.

Import necessary Keras functionality, set options for the models.

Options I'm keeping consistent throughout modeling:

Create a function to run any passed model, re-create the train/test split with the same seed, fit the model, and provide metrics and results:

Establish Macro-Level Hyperparameters

Function to Run Models

Define the Baseline Model

Build the baseline model, creating a function to return it so that it can be easily re-used without all the retyping of layers. This will just be used to test performance before and after data processing to see if it's benefiting the model.

Running the Baseline Model on Completely Unprocessed Images

So the model is overfitting as can be seen by its accuracy that isn't translating as well to the validation and test sets, but it's performing decently considering the images are completely unprocessed and the model is not complex. Many classes are being misclassified at this stage.

I'll now do some pre-processing on the images and see how it impacts the model. I'll use the same model for an apples to apples comparison.

Image Pre-Processing and Improving the Model

Denoising Images (Gaussian Filtering)

CNNs work better on slightly blurred images, usually, because noise is smoothed out.

RGB to HSV

This will make it easier to filter out objects and background that we don't want. It makes non-plant colors stick out.

Average Images in HSV

To bring out a little more detail in the average images.

You can definitely see a little more variety and defining features of the average images when converted to the HSV color range. Seedlings have definite cirvular patterns, some solid, some hollow. Many with bright centers.

Masking Out Backgrounds

Masking out objects in the scene that we don't need will help isolate the plant structure. I do this via the HSV colors. Essentially i'm just filtering out everything that isn't green.

Applying the Masks to the Original Denoised Images

Now that I have masks created, I'm going to apply the masks to the denoised images from above to get us down to what we're interested in, the plants themselves.

Normalization

Since pixels can only have values between 0-255, dividing by the max value normalizes the data by creating a proportion that falls within 0-1. This can help the learning algorithm.

Note that rather than normalizing the images this way, you can use a Gaussian filter within the model itself. There's a Keras layer just for doing so.

Baseline Model on Processed Images

This will be compared to the baseline model on the unprocessed images to see how much predictive performance has improved just by processing the images into a different format.

The model is still misclassifying some, but it has improved from the unprocessed images considerably. This bodes well for future steps.

Oversampling

An imbalanced class situation can make it difficult for the algorithms to differentiate between classes, so I'm going to oversample the dataset.

I didn't get good results with SMOTE on images, so instead I'm using a basic Random Oversampler to resample with replacement and get all the classes balanced.

Baseline Model on Oversampled Images

This will allow me to see what predictivce performance improvement I've achieved simply by oversampling the imbalanced classes.

The baseline model is actually performing really well on the Test set. There is some overtraining but I think this is almost as good as it's going to get with this many images in the dataset.

Increasing the Model Complexity

Key changes to this model are a change of the kernel size to 5 on the convolution laters and more dense neurons.

Conclusion and Takeaways

Although the test/val performance is lagging behind the training performance a bit still, the model is decently fit now. I suspect some of the inability to get higher performance is linked to us using a pared down dataset for this project.

Overall, this model provides roughly 86% accuracy/recall in identifying a seedling correctly.

It struggles in differentiating between seedling 0 and 6 (Black Grass and Loose Silky Bent), most likely because they have very basic profiles and no extremely defining characteristics between the two. If the end goal is to identify seedlings vs. grasses then this may not be a problem.

Other things I tried that were NOT successful were using the ImageDataGenerator's options to rotate, zoom, and otherwise vary images in order to create more avenues for the algorithm to pick up on. Gaussian Blurring was also not seemingly helping the model, but I left it in due to project requirements.

I think a key takeaway here is the importance of properly preparing and processing images prior to building a model, particularly if you're getting poor results to begin with. The largest jumps in performance were definitely image preparation and oversampling.